Reddit is an American social news aggregation, web content rating, and discussion website. Registered members submit content to the site such as links, text posts, and images, which are then voted up or down by other members.
Bot is a very common object inside reddit community. It is created by other users to respond to other comments and the bot can perform different kinds of tasks such as gathering information, create meme or just leave a stupid message to other users.
Parent Comment is the top level comment that would contains other comments and users usually summon the bots in this level.
Bot Comment is the second level comment triggered by the call function from other users.
Child Comment is the third level comment and users would use this level to reply the bot comments.
Bobby-B is a Bot based on quotes (64 of them). This bot will reply to the parent comments with random quotes from the character, King Robert Baratheon, from Game Of Thrones.
So… I want to see the interaction between the bot and human.
I have used praw package to scrape down the comments from reddit.
posts = []
replies_code = []
clean_parent = []
botnamecol = []
bot_names = list(bots["Bot Name"])
bot_names_small = bot_names[:11]
for name in bot_names_small:
try:
for comment in reddit.redditor(name).comments.new(limit = 1000):
posts.append([comment.link_title,comment.link_url, comment.body,comment.link_permalink,comment.ups,comment.downs,comment.subreddit,comment.id,comment.link_id,comment.parent_id,comment.score ])
replies_code.append(str([comment.id])[2:-2])
botnamecol.append(name)
except:
continue
posts = pd.DataFrame(posts,columns=['title','post_link', 'comment','link','upvote','downvote','subreddit','id','link_id','parent_id','comment karma'])
clean_sub = []
for a in posts['link_id']:
sub = a[3:]
clean_sub.append([sub])
submission = pd.DataFrame(clean_sub,columns=['sub_id'])
newsubmission = []
for a in submission['sub_id']:
if reddit.submission(a).selftext:
newsubmission.append(reddit.submission(a).selftext)
else:
newsubmission.append('Submission without text')
for parent_id in posts['parent_id']:
parent_id2 = parent_id[3:]
clean_parent.append([parent_id2])
newparent = pd.DataFrame(clean_parent,columns=['clean_parent_id'])
newparentcomment = []
for a in newparent['clean_parent_id']:
try:
newparentcomment.append(reddit.comment(a).body)
except:
newparentcomment.append('Same as post')
posts['parent comment'] = newparentcomment
posts['bot name'] = botnamecol
replies3 = []
for code in replies_code:
try:
comment = reddit.comment(code)
comment.reply_sort = 'new'
comment.refresh()
replies = comment.replies
replies.replace_more(limit=None)
if replies:
sub_replies = []
for x in replies:
sub_replies.append(x.body)
replies3.append(' '.join(sub_replies))
else:
replies3.append('No replies')
except:
replies3.append('No replies because bot does not work')
posts['replies'] = replies3
posts['post_text'] = newsubmission
posts = posts[['bot name','title','post_text','post_link', 'parent comment','comment','replies','link','upvote','downvote','comment karma','subreddit','id','link_id','parent_id']]
posts.to_csv(r"/content/drive/Shared drives/Reddit_Bot_Project/top10_final_oresentation.csv")library(rjson)
library(dplyr)
library(jsonlite)
library(tidytext)
library(tidyverse)
library(dplyr)
library(stringr)
library(stm)
library(textstem)
library(tm)
library(ggplot2)
library(DT)
bobby_comment <- fromJSON("bobby-b-bot.comments.json")
bobby_parent <- fromJSON("bobby-b-bot.parentcomments.json")
bobby_child <- fromJSON("bobby-b-bot.childcomments.json")
jockers<- lexicon::hash_sentiment_jockers ### I decided to use jockers as my main lexiocn reference
bobby_child$parent_id <- sub("t1_","",bobby_child$parent_id) ### Remove id label
bobby_child <- bobby_child %>%
mutate(child_text = text) %>%
select(-post_id, -text)
bobby_child <- bobby_child %>% ### Paste all the child comment together that are belong to the same bot comment
group_by(parent_id)%>%
summarise( replies = paste(child_text, collapse = " | ") )
bobby_comment_child <- bobby_comment %>%
inner_join(bobby_child, by = c("comment_id" = "parent_id"))
bobby_comment_child <- bobby_comment_child %>% ### Count how many child comment under the bot comment
mutate(repliescount = stringr::str_count(.$replies,'\\|'),
repliescount = ifelse(repliescount>0,repliescount+1,1))
bobby_comment_child$parent_id <- sub("t1_","",bobby_comment_child$parent_id)
bobby_comment_child$sub_id <- sub("t3_","",bobby_comment_child$sub_id)
bobby_parent1 <- bobby_parent %>%
mutate(parent_text = text) %>%
select(-text)
bobby_comment_child2<- bobby_comment_child %>%
left_join(bobby_parent1, by = c("parent_id" = "parent_id"))
bobby_comment_child3 <- bobby_comment_child2 %>%
select(parent_text,text,score,replies,repliescount)
bobby_comment_child4 <- bobby_comment_child3 %>% ### Clean the text data
mutate(parent_text = as.character(parent_text),
parent_text = str_replace_all(parent_text, "\n", " "),
parent_text = str_replace_all(parent_text, "(\\[.*?\\])", ""),
parent_text = str_squish(parent_text),
parent_text = gsub("([a-z])([A-Z])", "\\1 \\2", parent_text),
parent_text = tolower(parent_text),
parent_text = removeWords(parent_text, c("’", stopwords(kind = "en"))),
parent_text = removePunctuation(parent_text),
parent_text = removeNumbers(parent_text),
parent_text = textstem::lemmatize_strings(parent_text),
text = as.character(text),
text = str_replace_all(text, "\n", " "),
text = str_replace_all(text, "(\\[.*?\\])", ""),
text = str_squish(text),
text = gsub(" ?(f|ht)tp(s?)://(.*)[.][a-z]+", "",text),
text = gsub("([a-z])([A-Z])", "\\1 \\2", text),
text = tolower(text),
text = removeWords(text, c("’", stopwords(kind = "en"))),
text = removePunctuation(text),
text = removeNumbers(text),
text = textstem::lemmatize_strings(text),
replies = as.character(replies),
replies = str_replace_all(replies, "\n", " "),
replies = str_replace_all(replies, "(\\[.*?\\])", ""),
replies = str_squish(replies),
replies = gsub("([a-z])([A-Z])", "\\1 \\2",replies),
replies = tolower(replies),
replies = removeWords(replies, c("’", stopwords(kind = "en"))),
replies = removePunctuation(replies),
replies = removeNumbers(replies),
replies = textstem::lemmatize_strings(replies)) %>%
as.data.frame()
bobby_comment_child4 <- bobby_comment_child4 %>% ### Assign id for later lexicon join
mutate(id = 1:n())
test1 <- bobby_comment_child4 %>% ### Inner join lexicon with parent comment
unnest_tokens(word,parent_text) %>%
inner_join(jockers, by = c("word" = "x"))
test1$y[is.na(test1$y)] <- 0
parent_text_t <- test1 %>%
group_by(id)%>%
summarise(parent_text_score = round(mean(y[y!=0]),2))
test2 <- bobby_comment_child4 %>% ### Inner join lexicon with bot comment
unnest_tokens(word,text) %>%
inner_join(jockers, by = c("word" = "x"))
test2$y[is.na(test2$y)] <- 0
text_t <- test2 %>%
group_by(id)%>%
summarise(text_score = round(mean(y[y!=0]),2))
test3 <- bobby_comment_child4 %>% ### Inner join lexicon with child comment
unnest_tokens(word,replies) %>%
inner_join(jockers, by = c("word" = "x"))
test3$y[is.na(test3$y)] <- 0
replies_t <- test3 %>%
group_by(id)%>%
summarise(replies_score = round(mean(y[y!=0]),2))
bobby_comment_child_score<- bobby_comment_child4%>% ### Create the new table with scores
inner_join(parent_text_t, by = c("id" = "id"))%>%
inner_join(text_t, by = c("id" = "id"))%>%
inner_join(replies_t, by = c("id" = "id"))
### Group by bot comments in order to filter out unqualified bot comment (ex. bot's system fail message )
group_attempt <- bobby_comment_child_score %>%
group_by(text) %>%
summarise(avg_replies_score= mean(replies_score),
avg_parent_score= mean(replies_score),
avg_text_score= mean(text_score),
count = n()) %>%
filter(count>1) %>%
arrange(desc(avg_replies_score),desc(count))
bobby_comment_child_score2 <- bobby_comment_child_score %>%
inner_join(group_attempt, by = c("text" = "text"))
bobby_comment_child_score2 <- bobby_comment_child_score2 %>% ### Reassign the id
select(-id)%>%
mutate(id = 1:n())Now, I am more interested in how many child comments belong to the same bot comment because the sentiment will be affected by how many comments and words within the comment.
As you can see from the graph, most of the bot comments get one to three replies, so I decide to focus on those who has more than one child comment to make sure our sentiment analysis is more reasonable.
bobby_comment_child_score2 %>%
filter(repliescount< 10) %>%
ggplot()+
geom_histogram(aes(repliescount), bins = 10, fill = "#67C2EE", alpha=0.7 )+
theme_light() +
theme(
legend.position = "none",
panel.border = element_blank(),
) +
xlab("How Many Child Comment Count Belong To The Same Bot") +
ylab("Count ") After the distribution graph, I run the linear regression to see the correlation between parent comments and child comments; we get a pretty decent p-value, indicating that there is a strong correlation, so I move on to plot the graph.
##
## Call:
## lm(formula = parent_text_score ~ replies_score, data = bobby_comment_child_score2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.11869 -0.42500 0.03416 0.44975 0.99081
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.063938 0.004818 13.270 < 2e-16 ***
## replies_score 0.054752 0.008729 6.272 3.67e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5401 on 12869 degrees of freedom
## Multiple R-squared: 0.003048, Adjusted R-squared: 0.00297
## F-statistic: 39.34 on 1 and 12869 DF, p-value: 3.672e-10
I set up two groups to compare the slopes in different situations:
Positive Bot Comment Graph : The bot sentiment score is set over 0.8 to see how does the child comment interact with parent comment.
Negative Bot Comment Graph : The bot sentiment score is set below -0.8 to see how does the child comment interact with parent comment.
bobby_comment_child_score2 %>%
filter((text_score >= 0.8) & repliescount >= 2 )%>%
ggplot()+
geom_point(aes(parent_text_score,replies_score),colour = "#67C2EE",alpha = 0.7)+
geom_smooth(method = "lm", se = FALSE,aes(parent_text_score,replies_score),colour = "#FF4500", alpha=0.5, size=1.5)+
theme_light() +
theme(
legend.position = "none",
panel.border = element_blank(),plot.title = element_text(hjust=0.5),
) +
xlab("Parent Comment Score ") +
ylab("Child Comment Score")+
ggtitle("Extremely Positive Bot Comment Sentiment Score")
bobby_comment_child_score2 %>%
filter((text_score <= -0.8) & repliescount >= 2 )%>%
ggplot()+
geom_point(aes(parent_text_score,replies_score),colour = "#67C2EE",alpha = 0.7)+
geom_smooth(method = "lm", se = FALSE,aes(parent_text_score,replies_score),colour = "#FF4500", alpha=0.5, size=1.5)+
theme_light() +
theme(
legend.position = "none",
panel.border = element_blank(),plot.title = element_text(hjust=0.5),
) +
xlab("Parent Comment Score ") +
ylab("Child Comment Score")+
ggtitle("Extremely Negative Bot Comment Sentiment Score") Positive Bot Comment Graph gives us an idea that when the bot sentiment score is very positive, the child comment has positve relation with parent comment. Negative Bot Comment Graph also indicates positive correlation and has a slightly steeper slope.
Things become interesting, if you look closer and compare the graphs, you will find out that at Negative Bot Comment Graph, the starting point of the regression line is below 0; on the other hand, Positive Bot Comment Graph has a starting point above 0. The difference in starting points indicates that, actually, the bot comments function as a buffer between parent comment and child comment.
We could apply this knowledge into the chat bot software, so the bot can calculate the sentiment score from the complainers and respond back with sentences having accurate sentiment to calm the clients down.
In information retrieval, tf–idf or TFIDF, short for term frequency–inverse document frequency, is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus.It is often used as a weighting factor in searches of information retrieval, text mining, and user modeling. The tf–idf value increases proportionally to the number of times a word appears in the document and is offset by the number of documents in the corpus that contain the word, which helps to adjust for the fact that some words appear more frequently in general.
The top 15 words that are important in each level are listed below. It is interesting to see that, the bot seems to mention words that are provocative and dramatic to trigger users to respond. As for the child comment, sentiment tends to be very important and it implies that the bot comment is very human-like.
### Only keep the words with mroe than two characters
bobby_comment_child3$parent_text<-gsub('\\b\\w{1,2}\\b','',bobby_comment_child3$parent_text)
bobby_comment_child3$text <- gsub('\\b\\w{1,2}\\b','',bobby_comment_child3$text)
bobby_comment_child3$text <- gsub("?(f|ht)tp(s?)://(.*)[.][a-z]+","",bobby_comment_child3$text)
bobby_comment_child3$replies <-gsub('\\b\\w{1,2}\\b','',bobby_comment_child3$replies)
parent <- paste(bobby_comment_child3$parent_text, collapse = ",")
bot_text <- paste(bobby_comment_child3$text ,collapse = ",")
child <- paste(bobby_comment_child3$replies,collapse = ",")
text_data = data.frame(level = c("parent", "bot_text", "child"),
text = c(tolower(parent), tolower(bot_text),
tolower(child)),
stringsAsFactors = FALSE)
textTF = text_data %>%
split(., .$level) %>%
lapply(., function(x) {
textTokens = tm::MC_tokenizer(x$text)
tokenCount = as.data.frame(summary(as.factor(textTokens), maxsum = 1000))
total = length(textTokens)
tokenCount = data.frame(count = tokenCount[[1]],
word = row.names(tokenCount),
total = total,
level = x$level,
row.names = NULL)
return(tokenCount)
})
textTF = do.call("rbind", textTF)
textTF$tf = textTF$count/textTF$total
### idf
idfDF = textTF %>%
group_by(word) %>%
count() %>%
mutate(idf = log((length(unique(textTF$level)) / n)))
### tf-idf
tfidfData = merge(textTF, idfDF, by = "word")
tfidfData$tfIDF = tfidfData$tf * tfidfData$idf
### top 15
tfidfData %>%
group_by(level) %>%
arrange(level, desc(tfIDF)) %>%
slice(1:15) %>%
rmarkdown::paged_table()N-grams of texts are extensively used in text mining and natural language processing tasks. They are basically a set of co-occuring words within a given window and when computing the n-grams you typically move one word forward. In short, it is a fun way to see how the words structured in the bot comment. It seems like Bob needs mooooooooar wine!
### N-Gram ####
bigrams = bobby_comment_child4 %>%
mutate_all(~ifelse(. %in% c("N/A", "null", ""), NA, .)) %>%
na.omit() %>%
unnest_tokens(., ngrams, text, token = "ngrams", n = 2) %>%
tidyr::separate(ngrams, c("word1", "word2"), sep = "\\s") %>%
count(word1, word2, sort = TRUE)
rmarkdown::paged_table(bigrams)Topic modeling is an unsupervised machine learning technique that’s capable of scanning a set of documents, detecting word and phrase patterns within them, and automatically clustering word groups and similar expressions that best characterize a set of documents.
The graph below is showing that how do we decide how many topics (k) are within the documents.The 4 plots are going to help us determine the best number of topics to take. I would like to focus on semantic coherence (how well the words hang together – computed from taking a conditional probability score from frequent words) and the residuals(the difference between the observed value of the dependent variable and the predicted value). We want to have low residual and high semantic coherence. The residuals definitely take a sharp dive as we increase K. I decide to use k = 6, indicating that there are 6 topics within the bot comment section.
I have found this website have introduced Topic Models pretty well!
set.seed(1001)
bobby_comment_child_score3 <- bobby_comment_child_score2
bobby_comment_child_score3$text <- gsub('\\b\\w{1}\\b','',bobby_comment_child_score3$text)
bobby_comment_child_score3$text <- gsub(" ?(f|ht)tp(s?)://(.*)[.][a-z]+", "", bobby_comment_child_score3$text)
holdoutRows = sample(1:nrow(bobby_comment_child_score3), 100, replace = FALSE)
bobbytidy = textProcessor(documents = bobby_comment_child_score3$text[-c(holdoutRows)],
metadata = bobby_comment_child_score3[-c(holdoutRows), ],
stem = FALSE)## Building corpus...
## Converting to Lower Case...
## Removing punctuation...
## Removing stopwords...
## Removing numbers...
## Creating Output...
bobbyPrep = prepDocuments(documents = bobbytidy$documents,
vocab = bobbytidy $vocab,
meta = bobbytidy $meta)
kTest = searchK(documents = bobbyPrep$documents,
vocab = bobbyPrep$vocab,
K = c(3, 4, 5,10), verbose = FALSE)
plot(kTest)
The following is showing that the proportion of each topic, with a few of the highest probability words.
topics6 = stm(documents = bobbyPrep$documents,
vocab = bobbyPrep$vocab, seed = 1001,
K = 6, verbose = FALSE)
plot(topics6)I generate 6 plots with child comment sentiment score as the x axis to see how do people react with different topcis in bot comment.
The first two graphs, they are showing positive relationship between the topic and sentiment score. Words like good, warm and honor do seem right to increase the sentiment score as it is taking more portion. Surprisingly, we can see that pregnant, breed and child could also help to increase the sentiment score.
finalbobbytidy = textProcessor(documents = bobby_comment_child_score3$text,
metadata = bobby_comment_child_score3,
stem = FALSE)## Building corpus...
## Converting to Lower Case...
## Removing punctuation...
## Removing stopwords...
## Removing numbers...
## Creating Output...
finalbobbyPrep = prepDocuments(documents =finalbobbytidy$documents,
vocab = finalbobbytidy$vocab,
meta = finalbobbytidy $meta)
topicPredictor = stm(documents = finalbobbyPrep$documents,
vocab = finalbobbyPrep$vocab, prevalence = ~ replies_score,
data = finalbobbyPrep$meta, K = 6, verbose = FALSE)
scoreEffect = estimateEffect(1:6 ~ replies_score, stmobj = topicPredictor,
metadata = finalbobbyPrep$meta)
plot.estimateEffect(scoreEffect, "replies_score", method = "continuous",
model = topicPredictor, topics = 1, labeltype = "frex")
plot.estimateEffect(scoreEffect, "replies_score", method = "continuous",
model = topicPredictor, topics = 2, labeltype = "frex")Compared with others, these two are showing negative relationship. As we mention more about words like stupid, pay and fear, it is reasonable to assume that these words will decrease the sentiment score. Also, it is funny to see that when Ned Stark was mentioned multiple times, the score goes down.
plot.estimateEffect(scoreEffect, "replies_score", method = "continuous",
model = topicPredictor, topics = 3, labeltype = "frex")
plot.estimateEffect(scoreEffect, "replies_score", method = "continuous",
model = topicPredictor, topics = 5, labeltype = "frex")The following graphs are the most interesting ones.When seven kingdom and kill are mentioned, people tends to dislike it. As for the right hand side, Bessie is Robert Baratheon’s mistress and when it is mentioned multiple times, people react in a very positive way.
plot.estimateEffect(scoreEffect, "replies_score", method = "continuous",
model = topicPredictor, topics = 4, labeltype = "frex")
plot.estimateEffect(scoreEffect, "replies_score", method = "continuous",
model = topicPredictor, topics = 6, labeltype = "frex")The following table is the data set and you can play around with it!
bobby_comment_child_score4 <- bobby_comment_child_score3 %>%
select(parent_text,text,replies,parent_text_score,text_score,replies_score,score)
colnames(bobby_comment_child_score4)[1] <- "Parent Comment"
colnames(bobby_comment_child_score4)[2] <- "Bot Comment"
colnames(bobby_comment_child_score4)[3] <- "Child Comment"
colnames(bobby_comment_child_score4)[4] <- "Parent Comment Sentiment Score"
colnames(bobby_comment_child_score4)[5] <- "Bot Comment Sentiment Score"
colnames(bobby_comment_child_score4)[6] <- "Child Comment Sentiment Score"
colnames(bobby_comment_child_score4)[7] <- "Bot Comment Score"
datatable(bobby_comment_child_score4, rownames = FALSE, filter="top", options = list(pageLength = 10, scrollX=T) )Thank You
A work by Michael Ma